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STEIN’S  LEMMA— A  LARGE  DEVIATIONS  APPROACH 


INTRODUCTION 

In  this  report,  we  prove  Stein’s  Lemma,  see  Ref.  1,  by  using  a  Large  Deviations  Principle.  This  idea 
was  first  proposed  in  Ref.  2;  we  provide  a  proof  that  is  more  general,  direct,  and  intuitive. 

Stein’s  Lemma  is  formulated  as  follows.  Let  {Al,,}^  be  a  sequence  of  i.i.d.  observations  defined  on 
some  underlying  probability  triple  R)  and  taking  values  in  a  measurable  space  ( E,S ).  YVe  know  that 

the  probability  measure  R  is  one  of  two  probability  measures  P  or  Q.  For  each  n  =  1,2,...,  we  form  a 
Neyman- Pearson  test  to  decide  whether  R  =  P  or  R  =  Q  on  the  basis  of  Xi,  X2,  ■  ■  ■  •  X„  (clearly,  we  need 
that  P  ^  Q  for  this  problem  to  be  meaningful).  Stein’s  Lemma  states  that  for  a  fixed  power  constraint,  the 
size  of  the  Neyman -Pearson  tests  decays  at  an  exponential  rate  and  provides  a  formula  for  this  rate. 

To  place  the  problem  in  a  rigorous  setting,  let  { Xn}'?  be  the  filtration  of  T  generated  by  the  observations; 

Xn  :=<r{Xl,X2...Xn}.  n  =  l,2,...  (1) 

Let  0  <  c  <  1  be  a  predetermined  constant,  and  take  n  =  1,2,....  For  each  set  D  in  Tn,  we  can  define 
a  decision  rule  to  select  P  or  <5  by  choosing  P  if  and  only  if  ui  £  D  for  any  oj  £  ft  (the  requirement  that 
D  be  in  Tn  is  of  course  equivalent  to  the  requirement  that  our  decision  be  a  function  of  the  observations 
A'i,AV  .A-n).  To  form  the  Neyman- Pearson  test  of  power  c,  we  vary  D  £  X„  so  as  to  minimize  the 
size  Q(D)  (the  false  alarm  rate  in  radar  parlance)  subject  to  the  requirement  that  the  power  P(D)  satisfy 
P{D)  >  1  — <  (i.e.,  a  lower  bound  on  the  detection  probability).  Let  e(n,c)  be  this  minimum,  or  more  exactly, 
infimum;  symbolically 

e(n,c)  :=  inf{Q(D)  :  D  £  P{D)  >  1  —  c}.  n=l,2,...  (2) 

Define  P  (respectively  Q )  as  the  probability  measure  induced  on  {E,€)  by  any  one  of  the  observation 
RV’s  A'i.Aj,...  under  the  probability  measure  P  (respectively  Q).  Since  the  observations  are  identically 
distributed,  it  does  not  matter  which  Xn  we  select  to  define  P  and  Q;  we  may  choose  P  =  PX[  and 
Q  =  Q X f 1 .  The  result  that  we  wish  to  prove  can  now  be  stated. 

THE  MAIN  RESULT 

Theorem  1  (Stein).  Assume  that  P  is  absolutely  continuous  with  respect  to  Q.  Then 

iim  —  log  e(n,  e)  =  —D(P,  Q)  (3) 

«  n 

where 

D{P,Q)  :=  log  ^dP,  (4) 

the  integral  possibly  being  infinite. 

If  the  observation  space  E  is  finite,  this  result  is  the  same  as  the  one  in  Ref.  3,  Corollary  2.2.2,  and  in 
Ref.  2.  We  note  that  D(P,Q)  is  the  Kullback-Leibler  informational  divergence  of  P  from  Q\  thus  we  know 
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that  D{P,Q)  is  well  defined;  see  Ref.  4  and  Appendix  A.  Note  also  that  if  P  is  not  absolutely  continuous 
with  respect  to  Q ,  the  Neyman-Pearson  tests  are  trivial  and  e(n,c)  =  0  for  n  large.  Indeed,  assume  that  l  is 
not  absolutely  continuous  with  respect  to  Q  so  that  there  is  a  set  A  in  £  such  that  P(A)  >  P  but  Q(A)  —  0. 
For  each  n  =  1,2 . define  the  decision  region  D„  G  Pn  as 

Dn  :  =  {A'i  G  A  for  some  i  =  1, 2, ....  n} 

n  (5) 


Then 


P(Dn)  =  1  -  P(Dn) 

=  i-p(f){x,eA) 

\«= i 

=  l  -  P(A)n , 


so  that  lirn„  P(Dn)  =  1.  and  consequently  D„  satisfies  the  power  constraint  for  n  large.  But  since 


(«) 


Q{  Dn  )  <  Y,Q{X'  £ 

i  =  l 

=  j2Q(A)  =  0, 


we  conclude  that  for  ri  large,  e(n,()  <  Q(Dn)  -  0. 


MOTIVATION  FOR  THE  PROOF  OF  STEIN’S  LEMMA 


It  is  a  well-known  result  that  Neyman-Pearson  tests  are  performed  by  comparing  a  log-likelihood  ratio 
to  a  threshold,  see  Ref.  5,  Thin.  5.5.2.  For  each  n  =  1, 2, . . . ,  let  P„  (respectively  Qn)  be  the  restriction  of 
p  (respectively  (?)  to  the  rr-field  Tn.  The  absolute  continuity  requirement  on  P  and  Q  implies  that  for  each 
ri  —  12 . P„  is  absolutely  continuous  with  respect  to  Qn,  so  that  our  log-likelihood  ratio  is  log  dPn/(IQ„- 

If  wc  define  - 


71=  1,2,. 


(») 


then  it  is  not  difficult  to  verify  that 


dPn 


=s" 


n  =  1,2, 


(9) 


vhere 


Sn:=X> 


71=1,2,...  (10) 


Note  that  the  sequence  ( ) '„ } is  an  i.i.d.  sequence  and  that  we  have  suggestively  written  the  log-likehhood 
ratio  as  a  partial  sum.  If  R  =  P,  then  by  the  Strong  Law  of  Large  Numbers  (SLLN), 


1  _  P-a 

n 


•7 

J  n 


YidP 


L 


=  /  ^ 


(H) 


=  D{P,Q). 
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Alternately,  if  R  =  Q,  we  would  then  expect  that  if  Q  is  absolutely  continuous  with  respect  to  P , 

-Sn  Q [  Y\dQ 

«  Jn 


=  -D(Q,  P). 


(12) 


(See  Appendix  B  for  the  SLLNs  that  are  required  in  Eqs.  (11)  and  (12)  if  the  integrals  are  infinite.) 

From  Eqs.  (II)  and  (12),  our  hypothesis  tests  should  reflect  the  fact  that  S„/n  has  different,  almost  sure 
limits  under  the  different  probability  measures  P  and  Q.  If  we  define  our  decision  regions  so  as  to  decide 
that  R  =  P  if  Sn/n  is  near  D(P,Q),  then  a  high  rate  of  detection  and  a  low  false  alarm  rate  should  result  for 
large  n.  (Note  that  Eqs.  (1 1)  and  (12)  explain  a  technical  difficulty.  If  P  agrees  with  Q  on  :=  V^°=i 
then  we  expect  not  to  be  able  to  distinguish  between  R  =  P  and  R  =  Q  from  the  observations.  This 
is  reflected  in  the  easily  verified  fact  that  P  =  Q  if  and  only  if  P  and  Q  coincide  on  in  which  case 
D(P,Q)  =  D(Q,P)  =  0  and  5„/n  tends  almost  surely  to  0  under  both  P  and  Q.)  Since  {ViJj0  is  i.i.d.,  we 
can  use  Cramer’s  theorem  from  the  field  of  Large  Deviations,  see  Ref.  6,  Theorem  3.8  and  Ref.  7,  Theorem 
3.1,  to  describe  the  rale  at  which  Sn/n  tends  to  its  limit  under  P  and  Q.  The  reasoning  behind  the  following 
arguments  is  then  clear. 


rROOF  OF  THEOREM  1 

A  Large  Deviations  Principle 
Let  us  temporarily  assume  that 

(a)  the  probability  measure  Q  is  absolutely  continuous  with  respect  to  the  probability  measure  P, 

(b)  D(P,  Q)  <  oo  and  D(Q,  P)  <  oo,  and 

(c)  the  moment  generating  function  M  of  Fi  under  Q\  i.e.,  M(0)  :=  eey'  dQ,  is  finite  for  all  0  in  IR. 
Under  these  assumptions,  we  may  directly  verify  the  upper  bound 

limsup  —  loge(n,f)  <  —  D(P,Q)  (13) 

n  n 

by  invoking  Cramer’s  Theorem  Fix  b  >  0  and  set  F(  :=  [D(P,Q)  —  6,o o).  From  Eq.  (11)  we  know 
that  if  R  =  P,  then  limn  S„/n  £  P«  P-a.s.  Since  almost  sure  convergence  is  stronger  than  convergence  in 
probability,  it  is  immediate  that, 

limP({5n/n£Pi})  =  l,  (14) 

n 

so  for  large  n,  the  decision  regions  given  by 


satisfy  the  power  constraint. 


Dn  {Sn/n  e  F6} 


We  can  now  apply  Cramer’s  Theorem  to  verify  that 


limsup  —  log Q(Dn)  <  -  inf  I(x), 

n  n 


n  —  1,2,...  (15) 


(16) 
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where  I  is  the  Legendre-Fenchel  transform,  see  Ref.  8,  Chapter  6,  of  log  A'f(-); 


l(x)  :  =  sup  ( Ox  —  logAf(0))  . 


X&IR  (17) 


From  Ref.  6,  Lemma  3.3,  we  know  that  x  >— *  I{x)  is  nondecreasing  for  x  >  Jn  Y\dQ  —  -D(Q,  P),  the  integral 
being  well  denned  under  assumption  (b).  Thus 


m f  /(*)=/ (D(P,Q)- 6). 


(18) 


But 


since 


/  (D(P ,Q)  -6)>  (1)  (D(P.  Q)  -6)-  log  M(l) 

=  D(P  ,Q)  —  6 

A/(l)  =  /  elog*§dQ  =  P(£)  =  1. 

Je 

Combining  Eqs.  (16)  through  (19),  we  have  that 

lim sup Q(Dn)  <  -D(P,Q)  +  6. 

n 

In  view  of  Eq.  (14),  we  then  have  that 

limsup  —  loge(n,e)  <  lim  sup  —  log  Q(Dn)  <  ~D(P,Q)  +  b. 

n  n  n  n 

Since  >  0  was  arbitrary,  Eq.  (13)  is  established. 

An  inspection  of  the  proof  of  Cramer’s  theorem  reveals  how  to  prove  Theorem  1  when  assumptions  (a) 
through  (c)  are  not  enforced. 

Case  1:  D(P,Q )  <  oo 

Upper  Bound:  Fix  6  >  0  and  again  set  Ft  :=  [ D(P,Q )  —  6,oo)  and 

Dn  :={Sn/neF6}. 


(19) 


(20) 


(21) 


(22) 


n  =  1 , 2, . . .  (23) 


As  in  the  above  arguments,  we  know  that  for  large  n,  Dn  satisfies  the  power  constraint.  Following  the 
arguments  of  Ref.  6,  Lemma  3.4,  we  argue  that  for  each  n  =  1,2,... 


Q(Dn)=  / 

4{S„>n 


{S„>n(D(P,<3)-*)} 


dQ 


and  consequently, 


<  f  exp  [Sn-n{D{P%Q-b)}dQ 
J  n 

=  cxp  [-n(E>(P,<5)-«5)]  J  es”dQ 
=  exp  (— n(D(P,Q)-6)](l), 


limsup  -  logQ(E>n)  <  -D(P,Q)  +  b. 

n  ft 


(24) 


(25) 


As  above,  this  is  sufficient  to  prove  the  upper  bound  Eq.  (13)  since  S  >  0  was  arbitrary. 
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Lower  Bound:  We  next  prove  that 


lim  inf  —  log  e(n,  c)  >  -D(P,  Q).  (26) 

n  V 

Our  proof  of  Eq.  (26)  is  essentially  the  same  as  that  in  Ref.  2.  Take  8  >  0.  Then  for  each  positive  integer 
n,  we  can  find  a  set  in  Tn  so  that 

P{Un)  >  1  -  e  (27) 

and 

Q(Un)<e(n,e)en6/\  (28) 

Define  Ff  :=  (  — x , /.)( /\ Q)  + />/ 2]  and  set 

Dn  ■'  =  {Sn/n  €  F6}.  n  =  1,2,...  (29) 

As  in  the  proof  of  the  upp<  r  hound,  the  SEEN  ensures  that  limn  P(Dn)  =  1,  so  necessarily 

lim  inf  P(Un  (T  Dn )  >  1  —  f .  (60) 


For  each  n  —  1,2 


P(i’n  n  i)n)  =  Pn(FnnDn) 

=  /  es"dQ 

A'.no, 

<  [  exp[n(D(P,Q)  +  6/2)}dQ 

<  exp  [n  (D(P,Q)  +  8/2)}  Q(Un  IT  Dn), 


(31) 


where  we  have  used  Eq.  (9)  and  the  fact  that  Sn  <  n  (D(P,Q)  +  8/ 2)  on  Dn ,  which  is  obvious  from  Eq. 
(29).  Thus,  upon  combining  Eq.  (28)  and  Eq.  (31),  we  have 


'(".0  >Q(('n)c-ntP 

>(?(f'nnOn)e-"'/J  (32) 

>  P(Fn  n  D„)exp  [-n  (D(P,Q)  +  8)}  , 


so  in  view  of  Eq.  (30), 

lim  inf  —  logc(n,  c)  >  —D(P,Q)  —  8\  (33) 

n  71 

since  8  >  0  was  arbitrary.  Eq.  (26)  is  true. 

Case  2:  D[P.Q)  =  -x. 


We  wish  to  prove  that 

lim  inf  —  log  e(n,  c)  =  —  oo.  (34) 

«  n 

f  rom  the  Sl.E.N  found  in  Appendix  H,  we  know  that  limn5n/n  =  oo  P- a.s.  if  R  =  P.  P'ix  a  positive  number 
B,  and  define  Ffl  [/I,  x  )  and  for  each  ri  =  1,2,  ... ,  let  the  decision  region  Dn  be  given  by 

l>n  '■=  {Brl/n  e  I'n )  (35) 
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Then  Iim„  P{Dn)  —  1  as  in  Case  1,  but  for  each  n  =  1,2, , 

Q(Dn)=  I  dQ 

J{S*>nB) 


<  f  expfS’r,  —  nB]dQ 
in 


=  e~nB  I  *5- 


L 


e^'dQ 


—  e-nB(  1), 


so  that 


limsup  —  log  Q(Dn)  <  —  B. 

n  H 


Hence,  in  a  manner  analogous  to  Eq.  (22), 


lini sup  —  log c(n,c)  <  limsup  —  log Q(Dn)  <  —  B, 

n  n  H 


and  since  B  was  an  arbitrary  positive  number, 


limsup  —  log  e(n,  e)  =  — oo, 

n  n 


which  was  to  be  proved. 

The  proof  of  Theorem  1  is  complete. 


(3(3) 


(37) 


(38) 


(39) 


CLOSURE 

Note  in  our  proof  of  Stein’s  Lemma  that  we  did  not  formulate  the  Ney man- Pearson  tests.  The  Strong 
Law  of  Large  Numbers  and  Eqs.  (11)  and  (12)  led  us  to  a  series  of  tests  that  bounded  the  true  Neyman- 
Pearson  tests.  The  asymptotic  behavior  of  these  tests  was  found  by  using  Large  Deviations  arguments,  and 
Stein’s  Lemma  resulted. 

In  the  Neyman-Pearson  tests  studied  here,  we  minimized  the  false  alarm  rate  subject  to  a  lower  bound 
on  the  probability  of  detection.  The  more  common  formulation  is  to  maximize  the  probability  of  detection 
subject  to  an  upper  bound  on  the  false  alarm  rate.  By  reversing  the  roles  of  P  and  Q,  we  see  that  the  two 
problems  are  equivalent.  Define 

y(n,  ()  :=  sup{Q(D)  :  D  E  Tn,  P(D)  <  «}.  n  =  l,2 —  (40) 

Then  7(n,f)  corresponds  to  maximizing  the  probability  of  detection  Q(D)  (the  power)  subject  to  the  con¬ 
straint  that  the  false  alarm  rate  P(D)  satisfy  P(D)  <(  (an  upper  bound  on  the  size).  Since 

7(n,e)  =  1  —  e(n,e),  n=l,2,...  (41) 

an  alternate  way  of  stating  the  result  of  Theorem  1  is 

lim  -  log(l  -  7(n,e))  =  -D{P,Q).  (42) 

n  n 
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Note  also  t hat  we  can  in  fact  relax  the  assumption  that  the  observations  are  i.i.d.  If  Pn  is  absolutely 
continuous  with  respect  to  Qn  for  each  n  —  1,2, ,  and  if  there  is  a  constant  M ,  possibly  infinite,  such  t  hat 

1  dP 

U  =  P-limn-iog-^f .  (d.'i) 

n  dQn 

then  it  is  easy  to  verify,  using  the  above  arguments,  that 

lim  —  log  e(n,  t)  =  — M.  (■15) 

n  n 

We  shall  leave  the  proof  of  this  extensio-  to  the  interested  reader. 
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Appendix  A 


THE  DIVERGENCE  INTEGRAL 


The  proof  that  the  divergence  integral  is  well  defined  is  difficult  to  find  in  the  literature;  here  we  provide 
a  simple  proof.  We  show  that 


/>4) 


where  i  :=  min{  — x,0}.  For  convenience,  define  A'  :=  Now 


diP  <  oo, 
JP 

35- 


(A  1) 


/  (log  A')-  dP  =  /  —  log  A'dP 

v  E  J{X<  1} 

=  /  -  A  log  AT  dQ 

J{X<  1} 

=  /  <p(-Y)dQ, 

•'{*<!} 


(■'12) 


where  ,r(f)  :=  -flog/  for  t  >  0  and  ,?(0)  =  0.  If  Q{A'  <  1}  =  0,  Eq.  (Al)  follows  immediately  from  Eq. 
(A2),  so  assume  that  Q{A'  <  1}  >  0.  By  differentiating  twice,  we  see  that  p  is  concave  on  [0,  oc ) ,  so  by 
Jensen’s  inequality, 


/  *>(A -)dQ  <  Q{X  <\)p(  ^  —  /  XdQ) 

J{ A' < i }  <  1;  y<A<i)  / 

which  is  clearly  finite;  returning  to  Eq.  (A2),  we  see  that  Eq.  (Al)  is  true  when  Q{A'  <  1}  >  0. 


(d.J) 


8 


Appendix  B 

THE  STRONG  LAW  OF  LARGE  NUMBERS 

The  proof  of  Eqs.  (11)  and  (12)  requires  the  following  formulation  of  the  Strong  Law  of  Large  Numbers. 

Proposition.  Let  be  a  sequence  of  i.i.d.  RVs  defined  on  an  underlying  probability  triple  (SLA,  P). 

Suppose  that  £7 [A' i ]  is  well  defined  and  — oo  <  EfAi]  <  oo.  Then 

-Y,XiP^  E[X ,].  (B\) 

1  i=i 


Proof.  If  E[Xx]  <  oc.  we  may  use  Ref.  9,  Theorem  2.3.1  to  verify  Eq.  (Bl);  assume  that  P[Ai]  =  oo.  Take 
any  positive  constant  B,  and  define 

X*  :=  min{A„,B}.  n=l,2,...  (B2) 


Clearly  the  { A®  }^° 
P-a.s. 


But  P- as. 


are  i.i.d.  and  P-integrable,  so  Ref.  9,  Theorem  2.3.1  again  applies  and  we  conclude  that 


(B  3) 


1  n  1  ” 
irn  inf  —  ^  Xi  >  lim  inf  -  ^  A'®  >  E^A'i®] 

i=l  i= 1 


(BA) 


Since  B  was  an  arbitrary  positive  constant,  we  let  B  tend  to  infinity,  and  by  the  Monotone  Convergence 
Theorem,  we  then  have  from  Eq.  (B4)  that  P-a.s. 


1  " 

lim  inf  —)  A,  =  oo, 

n  n  •  ^ 


i  =  1 


(Bb) 


which  is  the  result  we  seek  when  P[A'i]  =  oo. 
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